| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
Spring 2024
How do we identify the effect of a treatment (cause) on an outcome?
Assuming successful randomization to treatment and control, you know it’s the treatment that’s causing the effect.
\(T\) a binary treatment variable
\(Y\) the value of the outcome we observe
\(Y^0\) the value the outcome would take if \(T=0\)
\(Y^1\) the value the outcome would take if \(T=1\)
Let’s think about the last two a bit more carefully…
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
What do these numbers mean?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 3 | 1 | 3 |
| Barb | 3 | NA | 0 | 3 |
| Catherine | NA | 4 | 1 | 4 |
| David | 2 | NA | 0 | 2 |
\[ Y = TY^1+(1-T)Y^0 \]
We really care about the difference between \(Y^0\) and \(Y^1\). (Why?)
Let \(\delta_i = y^1_i - y^0_i\)
\(E[\delta]=E[Y^1-Y^0]\)
\(E[\delta]=E[Y^1]-E[Y^0]\)
This is the definition of a treatment effect.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2 | 3 | NA | NA |
| Barb | 3 | 4 | NA | NA |
| Catherine | 3 | 4 | NA | NA |
| David | 2 | 3 | NA | NA |
If we could see this (invisible) world, what would be our calculation of the treatment effect?
Why can’t we make these calculations in real life?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 3 | 1 | 3 |
| Barb | 3 | NA | 0 | 3 |
| Catherine | NA | 4 | 1 | 4 |
| David | 2 | NA | 0 | 2 |
Does this give us the right answer? Why?
\(T \bot Y^0\)
\(T \bot Y^1\)
\(E[Y^0 | T = 0] = E[Y^0 | T = 1 ]\)
\(E[Y^1 | T = 0] = E[Y^1 | T = 1 ]\)
In a properly executed experiment, there is no association between the potential outcome variables and treatment assignment.
\(E[Y^0 | T = 0] \simeq E[Y^0]\)
\(E[Y^1 | T = 1] \simeq E[Y^1]\)
So…
\(E[\delta] = E[Y|T=1]-E[Y|T=0]\)
The difference between the treatment average and the control average
\(E[\delta]\) is the expected value (mean) of the difference between each unit’s value of \(Y^1\) and \(Y^0\). It is the average treatment effect (ATE). In a sample, this is the sample average treatment effect (SATE).
Even though the individual differences are unobservable (because either \(Y^0\) or \(Y^1\) will be counterfactual for each unit), we can estimate the mean difference via experiment.
\[\text{SATE} = \frac{1}{n}\sum_{i=1}^{n}(y^1_i - y^0_i)\]
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 4000 | 4000 | NA | NA |
| Barb | 2000 | 2000 | NA | NA |
| Catherine | 3000 | 3000 | NA | NA |
| David | 3000 | 3000 | NA | NA |
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 3000 | 1 | 3000 |
| David | 3000 | NA | 0 | 3000 |
Here, the people who go to college have different baseline earnings that have nothing to do with going to college.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 3000 | 1 | 3000 |
| David | 3000 | NA | 0 | 3000 |
\[\hat{\delta}_{naive} = E[Y | T = 1] - E[Y | T = 0]\]
\[\hat{\delta}_{naive} = 3500 - 2500 = 1000\]
Is this a good estimate of the SATE? Why or why not?
The treatment may not have a single effect, but may have different effects for different groups in the population. If the treatment and control groups (would) respond differently to treatment, this can bias the estimate of the effect.
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 2000 | 4000 | NA | NA |
| Barb | 2000 | 2000 | NA | NA |
| Catherine | 2000 | 4000 | NA | NA |
| David | 2000 | 2000 | NA | NA |
What is the effect of a college degree here?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 4000 | 1 | 4000 |
| David | 2000 | NA | 0 | 2000 |
How might we calculate the effect of a college degree here?
Does this give the right answer? Why or why not?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | 3000 | 4000 | NA | NA |
| Barb | 2000 | 3500 | NA | NA |
| Catherine | 3000 | 4000 | NA | NA |
| David | 2000 | 3500 | NA | NA |
What is the effect of a college degree here?
| Subject | Y0 | Y1 | T | Y |
|---|---|---|---|---|
| Andrew | NA | 4000 | 1 | 4000 |
| Barb | 2000 | NA | 0 | 2000 |
| Catherine | NA | 4000 | 1 | 4000 |
| David | 2000 | NA | 0 | 2000 |
How might we calculate the effect of a college degree on earnings here?
Does this give the right answer? Why or why not?
ATE is \(E(Y^1 - Y^0)\) for all units (effect of switching)
ATT is \(E(Y^1 - Y^0)\) for treated units (effect of taking away treatment)
ATC is \(E(Y^1 - Y^0)\) for untreated units (effect of adding treatment)
| Group | \(E(Y^1)\) | \(E(Y^0)\) |
|---|---|---|
| College degree | 1000 | 600 |
| No degree | 800 | 500 |
If 30% of the population has a degree…
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory. American Sociological Review, 86(3), 532–565.
Y: outcome
T: treatment
U: unobserved confounder
S: affects selection into T
X: affects Y directly
Regression attempts to identify \(T \rightarrow Y\) by adjusting for \(X\) while regressing \(Y\) on \(T\)
Matching and weighting attempt to identify \(T \rightarrow Y\) by ensuring that \(S\) has the same distribution for all values of \(T\)
Both are strategies to close the backdoor path between \(T\) and \(Y\)
These techniques only allow us to account for observed differences between treated and control cases. If \(V\) is unobserved, we can’t close the backdoor path with either of these approaches.
set.seed(1234)
# create the data
obs <- 1e6 # 1M observations to minimize randomness
d <- tibble(
U = rbinom(obs, 1, .5) , # unobserved difference
S = rbinom(obs, 1, .25 + .5*U) , # U -> S
T = rbinom(obs, 1, .25 + .5*S) , # S -> T
Y0 = round(2000 + 2000*U + rnorm(obs, 0, 500)) , # untreated outcome
Y1 = round(3000 + 2000*U + rnorm(obs, 0, 500)) , # treated outcome
Y = T*Y1 + (1-T)*Y0) |> # observed Y
select(-U, -(Y0:Y1)) # keep observed variables| S | T | mean_y |
|---|---|---|
| 0 | 0 | 2499 |
| 0 | 1 | 3498 |
| 1 | 0 | 3499 |
| 1 | 1 | 4500 |
What is the difference within levels of S? Does this give us the right answer? Why?
In an experiment, the treatment and control groups are otherwise the same.
Assumption 1: \(E(Y^1|T=1) = E(Y^1|T=0)\)
Assumption 2: \(E(Y^0|T=1) = E(Y^0|T=0)\)
There exist some observable variables, \(S\), which completely account for the differences between the treatment and control groups.
Assumption 1-S: \(E(Y^1|T=1,S) = E(Y^1|T=0,S)\)
Assumption 2-S: \(E(Y^0|T=1,S) = E(Y^0|T=0,S)\)
If we control for A and B here, we’re not unconfounding the relationship between T and Y. Rather, we’re controlling away part of the true effect of T! This is why we shouldn’t control for (or stratify on) post-treatment variables.
| SES | n | degree | earnings |
|---|---|---|---|
| 1 | 150 | 0 | 2000 |
| 1 | 50 | 1 | 4000 |
| 2 | 100 | 0 | 6000 |
| 2 | 100 | 1 | 8000 |
| 3 | 50 | 0 | 10000 |
| 3 | 150 | 1 | 14000 |
What are the ATE and ATT?
SATE = $2667
SATT = $3000
| SES | n | degree | earnings |
|---|---|---|---|
| 1 | 150 | 0 | 2000 |
| 1 | 0 | 1 | NA |
| 2 | 100 | 0 | 6000 |
| 2 | 100 | 1 | 8000 |
| 3 | 50 | 0 | 10000 |
| 3 | 150 | 1 | 14000 |
What is the ATE here? ATT?
Strata that have only treatment or control cases (not both) are called off support. Strata with both treatment and control cases are in the region of common support.
What is the effect of maternal smoking on infant health?
Data are from a subsample (N = 4642) of singleton births in PA between 1989-1991. See Almond et al. 2005. “The Costs of Low Birth Weight.”
d <- haven::read_dta(here("data", "cattaneo2.dta"))
d <- d |>
haven::zap_labels() |> # remove Stata labels
mutate( smoker = factor(mbsmoke,
labels = c("Nonsmoker", "Smoker")) ,
zweight = (bweight - mean(bweight)) / sd(bweight)) %>%
select( bweight, zweight, lbweight, mbsmoke, smoker, mmarried,
mage, medu, fbaby, alcohol, mrace, nprenatal ) vars n mean sd min max range se
bweight 1 4642 3361.68 578.82 340.00 5500.00 5160.00 8.50
zweight 2 4642 0.00 1.00 -5.22 3.69 8.91 0.01
lbweight 3 4642 0.06 0.24 0.00 1.00 1.00 0.00
mbsmoke 4 4642 0.19 0.39 0.00 1.00 1.00 0.01
mmarried 5 4642 0.70 0.46 0.00 1.00 1.00 0.01
mage 6 4642 26.50 5.62 13.00 45.00 32.00 0.08
medu 7 4642 12.69 2.52 0.00 17.00 17.00 0.04
fbaby 8 4642 0.44 0.50 0.00 1.00 1.00 0.01
alcohol 9 4642 0.03 0.18 0.00 1.00 1.00 0.00
mrace 10 4642 0.84 0.37 0.00 1.00 1.00 0.01
nprenatal 11 4642 10.76 3.68 0.00 40.00 40.00 0.05
| estimate | conf.low | conf.high |
|---|---|---|
| 0.476 | 0.404 | 0.547 |
The t-test gives us the naive estimate of the effect of smoking. If we assume the only difference between smokers and non-smokers is smoking, the effect of smoking is to reduce birthweight by .476 standard deviations (276g).
Note: later we will define formulas so we don’t have to re-type all the variables every time.
A matchit object
- method: Exact matching
- number of obs.: 4642 (original), 4642 (matched)
- target estimand: ATT
- covariates: mmarried, alcohol, mrace
Control Treated
All (ESS) 3778.000 864
All 3778.000 864
Matched (ESS) 2081.684 864
Matched 3778.000 864
Unmatched 0.000 0
Discarded 0.000 0
Rows: 4,642
Columns: 14
$ bweight <dbl> 3459, 3260, 3572, 2948, 2410, 3147, 3799, 3629, 2835, 3880, …
$ zweight <dbl> 0.16813549, -0.17566764, 0.36336038, -0.71469567, -1.6441734…
$ lbweight <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ mbsmoke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ smoker <fct> Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsmoker, Nonsm…
$ mmarried <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ mage <dbl> 24, 20, 22, 26, 20, 27, 27, 24, 21, 30, 26, 20, 34, 21, 23, …
$ medu <dbl> 14, 10, 9, 12, 12, 12, 12, 12, 12, 15, 12, 12, 14, 8, 12, 12…
$ fbaby <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, …
$ alcohol <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ mrace <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ nprenatal <dbl> 10, 6, 10, 10, 12, 9, 16, 11, 20, 9, 14, 5, 13, 8, 4, 10, 13…
$ weights <dbl> 0.6057834, 1.3247009, 0.6057834, 0.6057834, 0.6057834, 2.360…
$ subclass <fct> 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, …
subclass_table <- ematch_data %>%
group_by(mmarried, alcohol, mrace) %>%
summarize(
n_t = sum(mbsmoke), # Ntreat
n_c = sum(1-mbsmoke), # Ncon
zbw_t = weighted.mean(zweight, w = mbsmoke), # mean std bw for treated
zbw_c = weighted.mean(zweight, w = 1-mbsmoke), # mean std bw for control
row_diff = zbw_t - zbw_c, # mean treat-control diff
wt_t = weighted.mean( weights, w = mbsmoke), # mean treat weight
wt_c = weighted.mean( weights, w = 1-mbsmoke)) # mean control weight# A tibble: 8 × 10
# Groups: mmarried, alcohol [4]
mmarried alcohol mrace n_t n_c zbw_t zbw_c row_diff wt_t wt_c
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 113 373 -0.592 -0.493 -0.0996 1 1.32
2 0 0 1 291 539 -0.391 0.0506 -0.442 1 2.36
3 0 1 0 29 14 -0.970 -0.253 -0.717 1 9.06
4 0 1 1 22 13 -0.379 -0.514 0.135 1 7.40
5 1 0 0 19 182 -0.775 -0.262 -0.512 1 0.456
6 1 0 1 362 2613 -0.252 0.205 -0.457 1 0.606
7 1 1 0 4 6 -0.824 -0.0936 -0.731 1 2.92
8 1 1 1 24 38 -0.328 0.383 -0.711 1 2.76
The jury is still out on the theory behind some of procedures so it’s not obvious in many cases what the right way is to calculate standard errors.
We will cover bootstrapping later which is now (once again) seen as a good option for getting standard errors.
See Iacus, King, and Porro. 2019. “A Theory of Statistical Inference for Matching Methods in Causal Research.” Political Analysis.
Matching: compare treatment/control differences in outcome on cases that are identical on other observed characteristics (i.e., matched)
Weighting: apply weights so that these stratum-specific differences are aggregated in a way that reflects the distribution of interest (e.g., ATT)
ematch_out2 <- matchit(mbsmoke ~ mmarried + alcohol + mrace +
fbaby + mage + medu + nprenatal ,
data = d,
method = "exact")
ematch_out2A matchit object
- method: Exact matching
- number of obs.: 4642 (original), 1235 (matched)
- target estimand: ATT
- covariates: mmarried, alcohol, mrace, fbaby, mage, medu, nprenatal
Control Treated
All (ESS) 3778.0000 864
All 3778.0000 864
Matched (ESS) 478.5805 362
Matched 873.0000 362
Unmatched 2905.0000 502
Discarded 0.0000 0
But we don’t want a method that requires us to throw away tons of data if we don’t have to!